Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Syntactic indexing: enqueuer and scheduler #62485

Merged
merged 56 commits into from
May 31, 2024
Merged

Conversation

keynmol
Copy link
Contributor

@keynmol keynmol commented May 7, 2024

Fixes GRAPH-124
Fixes GRAPH-121

This PR introduces three main components:

  • Enqueuer – a thin layer responsible for actually inserting the syntactic indexing records into the database.
  • Scheduler - a service that
    • Identifies repositories that haven't been processed in a while
    • Identifies policies that match those repositories (policies that have syntactic indexing enabled)
    • Identifies commits that match any of the policies
    • And finally, enqueues the jobs to index the discovered repositories and commits
  • Scheduler job – a periodic routine that triggers Scheduler on with specified interval. This job runs as part of the main Worker service, and only schedules jobs if the experimental syntactic indexing feature is enabled.

Refactoring:

  • Making some methods public in policies.Service to make it easier to test logic that depends on glob matching of repository names (this matching requires a separate state to be updated)
  • Extracting some test utilities into a separate package

TODO:

  • [ ] Tests for policy iterator - we're currently investigating whether policy iterator is needed at all, as pagination of policies is in question.
  • Tests for Scheduler
  • Wire in experimental feature flag

Test plan

  • New tests for all components

@cla-bot cla-bot bot added the cla-signed label May 7, 2024
@github-actions github-actions bot added team/graph Graph Team (previously Code Intel/Language Tools/Language Platform) team/product-platform labels May 7, 2024
@keynmol keynmol force-pushed the syntactic-indexing-enqueuer branch from 11d5272 to 03414ce Compare May 13, 2024 11:11
@keynmol keynmol changed the title Syntactic indexing: enqueuer Syntactic indexing: enqueuer and scheduler May 20, 2024
Copy link
Contributor

Caution

License checking failed, please read: how to deal with third parties licensing.

@keynmol keynmol marked this pull request as ready for review May 23, 2024 11:30
Copy link
Contributor

Caution

License checking failed, please read: how to deal with third parties licensing.

Copy link
Contributor

@varungandhi-src varungandhi-src left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some initial comments; will take a deeper look at more of the code shortly.

cmd/worker/shared/init/db/db.go Outdated Show resolved Hide resolved
internal/codeintel/syntactic_indexing/enqueuer.go Outdated Show resolved Hide resolved
internal/codeintel/syntactic_indexing/enqueuer.go Outdated Show resolved Hide resolved
internal/codeintel/syntactic_indexing/enqueuer.go Outdated Show resolved Hide resolved
internal/codeintel/syntactic_indexing/enqueuer.go Outdated Show resolved Hide resolved
internal/codeintel/syntactic_indexing/jobstore/store.go Outdated Show resolved Hide resolved
internal/codeintel/syntactic_indexing/jobstore/store.go Outdated Show resolved Hide resolved
internal/codeintel/syntactic_indexing/jobstore/store.go Outdated Show resolved Hide resolved
Copy link
Contributor

@varungandhi-src varungandhi-src left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestions

  1. (Strong) Let's avoid doing the repo check again inside the doubly nested loop.
  2. (Weak) Let's avoid doing the commit check multiple times
  3. (Weak) Let's avoid revisiting the same commit multiple times per repo
  4. (Weak) Let's simplify the env vars. We can make it more complicated later.
  5. (Strong) Please clarify the policies array elements in scheduler_test.go

Would prefer Strong suggestions to be addressed pre-merge.

Thanks for your patience. It's been a bit hard for me to wrap my head around all the logic here, so this took longer to review than anticipated.

Comment on lines 101 to 141
commitsToSchedule := make(map[api.RepoID]collections.Set[api.CommitID])
enqueueOptions := EnqueueOptions{force: false}

var allErrors errors.MultiError

for _, repoToIndex := range repos {
repo, _ := s.RepoStore.Get(ctx, api.RepoID(repoToIndex.ID))
policyIterator := internal.NewPolicyIterator(s.PoliciesService, repoToIndex.ID, internal.SyntacticIndexing, schedulerConfig.PolicyBatchSize)
err := policyIterator.ForEachPoliciesBatch(ctx, func(policies []policiesshared.ConfigurationPolicy) error {
commitMap, err := s.PolicyMatcher.CommitsDescribedByPolicy(ctx, int(repoToIndex.ID), repo.Name, policies, currentTime)

if err != nil {
return err
}

for commit, policyMatches := range commitMap {
if len(policyMatches) == 0 {
continue
}
if commits := commitsToSchedule[repo.ID]; commits != nil {
commits.Add(api.CommitID(commit))
} else {
commitsToSchedule[repo.ID] = collections.NewSet(api.CommitID(commit))
}
}

return nil
})

if err != nil {
allErrors = errors.Append(allErrors, errors.Newf("Failed to discover commits eligible for syntactic indexing for repo [%s]: %v", repo.Name, err))
}
}

for repoId, commits := range commitsToSchedule {
for _, commitId := range commits.Values() {
if _, err := s.Enqueuer.QueueIndexingJobs(ctx, repoId, commitId, enqueueOptions); err != nil {
allErrors = errors.Append(allErrors, errors.Newf("Failed to schedule syntactic indexing of repo [ID=%s], commit [%s]: %v", repoId, commitId, err))
}
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@varungandhi-src I've re-done this part, main highlights:

  1. Collecting all repos and commits once
  2. Removing fail fast behavior in repo iterator
  3. Removing fail fast behavioru in (repo, commit) iterator

Additionally the enqueuer no longer performs revision checks, we'll leave that to the syntactic worker itself.

This means the scheduler will always do best effort scheduling, while returning all accumulated errors.

I hope my understanding of return allErrors is right here and equivalent to return nil if no allErrors = errors.Append calls ever happened.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this looks much better!

I hope my understanding of return allErrors is right here and equivalent to return nil if no allErrors = errors.Append calls ever happened.

Yep, we're following this pattern elsewhere too. I think you don't even need to declare the type as MultiError, you can see other places where errors.Append( can take the first argument as just error instead of MultiError. (search for var errs error)

@keynmol
Copy link
Contributor Author

keynmol commented May 31, 2024

I will merge this PR and if there are any other serious issues to attend to, it can happen in subsequent PRs, given this component is disabled by default.

@keynmol keynmol merged commit 2821447 into main May 31, 2024
11 checks passed
@keynmol keynmol deleted the syntactic-indexing-enqueuer branch May 31, 2024 09:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed team/graph Graph Team (previously Code Intel/Language Tools/Language Platform) team/product-platform
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants